Solar power is a central component of the transition to low-carbon energy systems. However, solar generation is highly variable: it depends not only on the time of day and season, but also on short-term weather conditions such as cloud cover, humidity, and temperature. For grid operators and planners, it is important to understand both how weather drives solar output and how uncertain that relationship is.
In this project, I analyze a real-world dataset that combines solar energy production with local weather measurements over time. The goal is to build a hierarchical Bayesian regression model that:
This work serves as the final project for a graduate course in Bayesian computational statistics. The focus is not only on obtaining point estimates, but on modeling and communicating uncertainty in a principled way.
The dataset, stored as solar_weather.csv, contains
time-stamped records of:
Energy.delta.Wh.: energy produced during each time
interval (Watthours). In the analysis, this variable is renamed to
energy_wh for convenience.GHI: global horizontal irradiance.isSun: indicator for whether the sun is above the
horizon (0/1).sunlightTime, dayLength,
SunlightTime.daylength: measures of available
sunlight.temp: temperature (°C).pressure: air pressure.humidity: relative humidity (%).wind_speed: wind speed.rain_1h, snow_1h: precipitation
indicators.clouds_all: cloud cover (%).weather_type: categorical description of
conditions.Time: timestamp.hour: hour of day.month: month of year (1–12).Each row corresponds to one time interval (here, 15 minutes). At
night, isSun and GHI are zero, and solar
energy production is essentially zero. During daylight hours, we see
positive irradiance and varying levels of solar output.
This project focuses on the following questions:
How do irradiance and weather conditions affect solar energy output during daylight hours? For example, how much additional energy can we expect when GHI increases by one standard deviation, after controlling for temperature and cloud cover?
How do these relationships vary by month? Even under similar instantaneous weather conditions, January and July differ in sun angle and typical atmospheric conditions. A hierarchical model that allows the baseline log-energy to vary by month can capture this.
What does the model predict for solar production under specific future weather scenarios, and how uncertain are those predictions? Instead of a single forecast, a Bayesian model provides a posterior predictive distribution describing a range of plausible outcomes.
A standard regression model could estimate the relationship between log-energy and weather variables by pooling all observations together. However, this ignores systematic seasonal differences. A fully separate model for each month would overfit and fail to share information.
A hierarchical Bayesian model offers a compromise:
This structure is well aligned with the Bayesian perspective discussed in Bayesian Data Analysis (Gelman et al., 2013): we encode exchangeability across months and estimate the degree of variation between them instead of assuming they are either identical or completely unrelated.
The remainder of this report is organized as follows:
We begin by loading the CSV file and inspecting its basic structure.
library(tidyverse)
solar <- read.csv("solar_weather.csv")
solar <- solar %>%
rename(
energy_wh = Energy.delta.Wh.
)
glimpse(solar)
## Rows: 196,776
## Columns: 17
## $ Time <chr> "2017-01-01 00:00:00", "2017-01-01 00:15:00", "…
## $ energy_wh <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GHI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ temp <dbl> 1.6, 1.6, 1.6, 1.6, 1.7, 1.7, 1.7, 1.7, 1.9, 1.…
## $ pressure <int> 1021, 1021, 1021, 1021, 1020, 1020, 1020, 1020,…
## $ humidity <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ wind_speed <dbl> 4.9, 4.9, 4.9, 4.9, 5.2, 5.2, 5.2, 5.2, 5.5, 5.…
## $ rain_1h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ snow_1h <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ clouds_all <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ isSun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sunlightTime <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ dayLength <int> 450, 450, 450, 450, 450, 450, 450, 450, 450, 45…
## $ SunlightTime.daylength <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather_type <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ hour <int> 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
summary(solar$energy_wh)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 573 577 5020
summary(solar$GHI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 1.6 32.6 46.8 229.2
summary(solar$temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.600 3.600 9.300 9.791 15.700 35.800
The raw dataset includes both night-time and daytime observations. At
night, isSun and GHI are zero, and energy
production is essentially zero. Since our goal is to model the
relationship between weather and solar production, we focus on daylight
intervals with positive irradiance and positive energy output.
solar_day <- solar %>%
filter(
isSun == 1,
GHI > 0,
energy_wh > 0
)
nrow(solar); nrow(solar_day)
## [1] 196776
## [1] 95790
summary(solar_day$energy_wh)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 179 617 1177 1969 5020
This filter removes night-time intervals and any records where irradiance or energy output is zero. The remaining data represent times when the system is actively producing power.
The distribution of raw energy values is right-skewed. To stabilize variance and approximate normality, we model the logarithm of energy:
\[ y_i = \log(\text{energy\_wh}_i). \]
We also standardize key continuous predictors (subtract mean and divide by standard deviation) to improve MCMC sampling and interpretability of coefficients.
solar_day <- solar_day %>%
mutate(
log_energy = log(energy_wh),
GHI_scaled = as.numeric(scale(GHI)),
temp_scaled = as.numeric(scale(temp)),
clouds_scaled = as.numeric(scale(clouds_all)),
month_factor = factor(month)
)
summary(solar_day$log_energy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.187 6.425 6.240 7.585 8.521
table(solar_day$month_factor)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 4842 6133 8446 9759 11297 11546 11678 10414 7263 6014 4556 3842
ggplot(solar_day, aes(x = log_energy)) +
geom_histogram(bins = 40, color = "black") +
labs(
title = "Distribution of log(energy_wh)",
x = "log(energy_wh)",
y = "Count"
)
The log-transformed energy is much closer to symmetric, which is more compatible with a Gaussian likelihood.
ggplot(solar_day, aes(x = GHI, y = energy_wh)) +
geom_point(alpha = 0.2) +
scale_y_log10() +
labs(
title = "Energy vs GHI (log scale on y)",
x = "GHI",
y = "energy_wh (log10 scale)"
)
There is a strong positive association between irradiance and energy production, as expected.
ggplot(solar_day, aes(x = month_factor, y = log_energy)) +
geom_boxplot() +
labs(
title = "log(energy_wh) by Month",
x = "Month",
y = "log(energy_wh)"
)
The boxplots suggest that some months (for example, summer vs. winter) have systematically higher or lower log-energy values, even after filtering to daylight. This visual evidence supports the decision to include month-level random intercepts in a hierarchical model.
We model the log-energy for each daylight interval \(i\) as:
\[ y_i = \log(\text{energy\_wh}_i) \sim \text{Normal}(\mu_i, \sigma), \]
with linear predictor
\[ \mu_i = \alpha_{j(i)} + \beta_{\text{GHI}} \,\text{GHI\_scaled}_i + \beta_{\text{temp}} \,\text{temp\_scaled}_i + \beta_{\text{clouds}} \,\text{clouds\_scaled}_i, \]
where:
We place hierarchical priors on the month-level intercepts:
\[ \alpha_j \sim \text{Normal}(\mu_\alpha, \tau_\alpha^2), \quad j = 1, \ldots, 12. \]
This expresses the idea that months are exchangeable: before seeing the data, we believe they are similar but not identical. The data determine how much they are allowed to differ.
We use weakly informative priors that are broad enough not to dominate the likelihood, but still regularize extreme values:
Because the predictors are standardized, a Normal(0, 1) prior on the regression coefficients corresponds roughly to believing that, a priori, a one-standard-deviation change in a predictor is unlikely to shift log-energy by more than a few units.
We fit the model using the brms package, which provides
a high-level interface to Stan.
library(cmdstanr)
set_cmdstan_path()
options(brms.backend = "cmdstanr")
library(brms)
priors <- c(
prior(normal(0, 5), class = "Intercept"),
prior(normal(0, 1), class = "b"),
prior(student_t(4, 0, 1), class = "sigma"),
prior(student_t(4, 0, 1), class = "sd") # random effects SD
)
fit_hier <- brm(
formula = log_energy ~ GHI_scaled + temp_scaled + clouds_scaled + (1 | month_factor),
data = solar_day,
family = gaussian(),
prior = priors,
chains = 4,
iter = 4000,
warmup = 1000,
cores = 4,
seed = 2025,
control = list(adapt_delta = 0.95)
)
## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 4000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 4000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 4000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 4000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 4000 [ 2%] (Warmup)
## Chain 2 Iteration: 100 / 4000 [ 2%] (Warmup)
## Chain 3 Iteration: 100 / 4000 [ 2%] (Warmup)
## Chain 3 Iteration: 200 / 4000 [ 5%] (Warmup)
## Chain 1 Iteration: 200 / 4000 [ 5%] (Warmup)
## Chain 2 Iteration: 200 / 4000 [ 5%] (Warmup)
## Chain 3 Iteration: 300 / 4000 [ 7%] (Warmup)
## Chain 2 Iteration: 300 / 4000 [ 7%] (Warmup)
## Chain 3 Iteration: 400 / 4000 [ 10%] (Warmup)
## Chain 2 Iteration: 400 / 4000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 4000 [ 2%] (Warmup)
## Chain 3 Iteration: 500 / 4000 [ 12%] (Warmup)
## Chain 2 Iteration: 500 / 4000 [ 12%] (Warmup)
## Chain 3 Iteration: 600 / 4000 [ 15%] (Warmup)
## Chain 2 Iteration: 600 / 4000 [ 15%] (Warmup)
## Chain 1 Iteration: 300 / 4000 [ 7%] (Warmup)
## Chain 3 Iteration: 700 / 4000 [ 17%] (Warmup)
## Chain 2 Iteration: 700 / 4000 [ 17%] (Warmup)
## Chain 1 Iteration: 400 / 4000 [ 10%] (Warmup)
## Chain 3 Iteration: 800 / 4000 [ 20%] (Warmup)
## Chain 2 Iteration: 800 / 4000 [ 20%] (Warmup)
## Chain 1 Iteration: 500 / 4000 [ 12%] (Warmup)
## Chain 3 Iteration: 900 / 4000 [ 22%] (Warmup)
## Chain 4 Iteration: 200 / 4000 [ 5%] (Warmup)
## Chain 2 Iteration: 900 / 4000 [ 22%] (Warmup)
## Chain 1 Iteration: 600 / 4000 [ 15%] (Warmup)
## Chain 3 Iteration: 1000 / 4000 [ 25%] (Warmup)
## Chain 3 Iteration: 1001 / 4000 [ 25%] (Sampling)
## Chain 2 Iteration: 1000 / 4000 [ 25%] (Warmup)
## Chain 2 Iteration: 1001 / 4000 [ 25%] (Sampling)
## Chain 1 Iteration: 700 / 4000 [ 17%] (Warmup)
## Chain 3 Iteration: 1100 / 4000 [ 27%] (Sampling)
## Chain 2 Iteration: 1100 / 4000 [ 27%] (Sampling)
## Chain 1 Iteration: 800 / 4000 [ 20%] (Warmup)
## Chain 1 Iteration: 900 / 4000 [ 22%] (Warmup)
## Chain 3 Iteration: 1200 / 4000 [ 30%] (Sampling)
## Chain 2 Iteration: 1200 / 4000 [ 30%] (Sampling)
## Chain 1 Iteration: 1000 / 4000 [ 25%] (Warmup)
## Chain 1 Iteration: 1001 / 4000 [ 25%] (Sampling)
## Chain 3 Iteration: 1300 / 4000 [ 32%] (Sampling)
## Chain 2 Iteration: 1300 / 4000 [ 32%] (Sampling)
## Chain 4 Iteration: 300 / 4000 [ 7%] (Warmup)
## Chain 3 Iteration: 1400 / 4000 [ 35%] (Sampling)
## Chain 1 Iteration: 1100 / 4000 [ 27%] (Sampling)
## Chain 2 Iteration: 1400 / 4000 [ 35%] (Sampling)
## Chain 4 Iteration: 400 / 4000 [ 10%] (Warmup)
## Chain 3 Iteration: 1500 / 4000 [ 37%] (Sampling)
## Chain 1 Iteration: 1200 / 4000 [ 30%] (Sampling)
## Chain 2 Iteration: 1500 / 4000 [ 37%] (Sampling)
## Chain 4 Iteration: 500 / 4000 [ 12%] (Warmup)
## Chain 3 Iteration: 1600 / 4000 [ 40%] (Sampling)
## Chain 1 Iteration: 1300 / 4000 [ 32%] (Sampling)
## Chain 2 Iteration: 1600 / 4000 [ 40%] (Sampling)
## Chain 3 Iteration: 1700 / 4000 [ 42%] (Sampling)
## Chain 4 Iteration: 600 / 4000 [ 15%] (Warmup)
## Chain 1 Iteration: 1400 / 4000 [ 35%] (Sampling)
## Chain 2 Iteration: 1700 / 4000 [ 42%] (Sampling)
## Chain 3 Iteration: 1800 / 4000 [ 45%] (Sampling)
## Chain 4 Iteration: 700 / 4000 [ 17%] (Warmup)
## Chain 1 Iteration: 1500 / 4000 [ 37%] (Sampling)
## Chain 2 Iteration: 1800 / 4000 [ 45%] (Sampling)
## Chain 3 Iteration: 1900 / 4000 [ 47%] (Sampling)
## Chain 4 Iteration: 800 / 4000 [ 20%] (Warmup)
## Chain 1 Iteration: 1600 / 4000 [ 40%] (Sampling)
## Chain 2 Iteration: 1900 / 4000 [ 47%] (Sampling)
## Chain 3 Iteration: 2000 / 4000 [ 50%] (Sampling)
## Chain 4 Iteration: 900 / 4000 [ 22%] (Warmup)
## Chain 1 Iteration: 1700 / 4000 [ 42%] (Sampling)
## Chain 3 Iteration: 2100 / 4000 [ 52%] (Sampling)
## Chain 2 Iteration: 2000 / 4000 [ 50%] (Sampling)
## Chain 4 Iteration: 1000 / 4000 [ 25%] (Warmup)
## Chain 4 Iteration: 1001 / 4000 [ 25%] (Sampling)
## Chain 1 Iteration: 1800 / 4000 [ 45%] (Sampling)
## Chain 3 Iteration: 2200 / 4000 [ 55%] (Sampling)
## Chain 2 Iteration: 2100 / 4000 [ 52%] (Sampling)
## Chain 3 Iteration: 2300 / 4000 [ 57%] (Sampling)
## Chain 1 Iteration: 1900 / 4000 [ 47%] (Sampling)
## Chain 2 Iteration: 2200 / 4000 [ 55%] (Sampling)
## Chain 4 Iteration: 1100 / 4000 [ 27%] (Sampling)
## Chain 3 Iteration: 2400 / 4000 [ 60%] (Sampling)
## Chain 2 Iteration: 2300 / 4000 [ 57%] (Sampling)
## Chain 1 Iteration: 2000 / 4000 [ 50%] (Sampling)
## Chain 4 Iteration: 1200 / 4000 [ 30%] (Sampling)
## Chain 3 Iteration: 2500 / 4000 [ 62%] (Sampling)
## Chain 1 Iteration: 2100 / 4000 [ 52%] (Sampling)
## Chain 2 Iteration: 2400 / 4000 [ 60%] (Sampling)
## Chain 4 Iteration: 1300 / 4000 [ 32%] (Sampling)
## Chain 3 Iteration: 2600 / 4000 [ 65%] (Sampling)
## Chain 2 Iteration: 2500 / 4000 [ 62%] (Sampling)
## Chain 1 Iteration: 2200 / 4000 [ 55%] (Sampling)
## Chain 3 Iteration: 2700 / 4000 [ 67%] (Sampling)
## Chain 4 Iteration: 1400 / 4000 [ 35%] (Sampling)
## Chain 2 Iteration: 2600 / 4000 [ 65%] (Sampling)
## Chain 1 Iteration: 2300 / 4000 [ 57%] (Sampling)
## Chain 3 Iteration: 2800 / 4000 [ 70%] (Sampling)
## Chain 4 Iteration: 1500 / 4000 [ 37%] (Sampling)
## Chain 2 Iteration: 2700 / 4000 [ 67%] (Sampling)
## Chain 1 Iteration: 2400 / 4000 [ 60%] (Sampling)
## Chain 3 Iteration: 2900 / 4000 [ 72%] (Sampling)
## Chain 4 Iteration: 1600 / 4000 [ 40%] (Sampling)
## Chain 2 Iteration: 2800 / 4000 [ 70%] (Sampling)
## Chain 1 Iteration: 2500 / 4000 [ 62%] (Sampling)
## Chain 3 Iteration: 3000 / 4000 [ 75%] (Sampling)
## Chain 4 Iteration: 1700 / 4000 [ 42%] (Sampling)
## Chain 2 Iteration: 2900 / 4000 [ 72%] (Sampling)
## Chain 1 Iteration: 2600 / 4000 [ 65%] (Sampling)
## Chain 3 Iteration: 3100 / 4000 [ 77%] (Sampling)
## Chain 2 Iteration: 3000 / 4000 [ 75%] (Sampling)
## Chain 4 Iteration: 1800 / 4000 [ 45%] (Sampling)
## Chain 1 Iteration: 2700 / 4000 [ 67%] (Sampling)
## Chain 3 Iteration: 3200 / 4000 [ 80%] (Sampling)
## Chain 2 Iteration: 3100 / 4000 [ 77%] (Sampling)
## Chain 3 Iteration: 3300 / 4000 [ 82%] (Sampling)
## Chain 1 Iteration: 2800 / 4000 [ 70%] (Sampling)
## Chain 4 Iteration: 1900 / 4000 [ 47%] (Sampling)
## Chain 2 Iteration: 3200 / 4000 [ 80%] (Sampling)
## Chain 3 Iteration: 3400 / 4000 [ 85%] (Sampling)
## Chain 1 Iteration: 2900 / 4000 [ 72%] (Sampling)
## Chain 4 Iteration: 2000 / 4000 [ 50%] (Sampling)
## Chain 3 Iteration: 3500 / 4000 [ 87%] (Sampling)
## Chain 2 Iteration: 3300 / 4000 [ 82%] (Sampling)
## Chain 1 Iteration: 3000 / 4000 [ 75%] (Sampling)
## Chain 4 Iteration: 2100 / 4000 [ 52%] (Sampling)
## Chain 3 Iteration: 3600 / 4000 [ 90%] (Sampling)
## Chain 2 Iteration: 3400 / 4000 [ 85%] (Sampling)
## Chain 1 Iteration: 3100 / 4000 [ 77%] (Sampling)
## Chain 4 Iteration: 2200 / 4000 [ 55%] (Sampling)
## Chain 3 Iteration: 3700 / 4000 [ 92%] (Sampling)
## Chain 2 Iteration: 3500 / 4000 [ 87%] (Sampling)
## Chain 1 Iteration: 3200 / 4000 [ 80%] (Sampling)
## Chain 3 Iteration: 3800 / 4000 [ 95%] (Sampling)
## Chain 4 Iteration: 2300 / 4000 [ 57%] (Sampling)
## Chain 2 Iteration: 3600 / 4000 [ 90%] (Sampling)
## Chain 1 Iteration: 3300 / 4000 [ 82%] (Sampling)
## Chain 2 Iteration: 3700 / 4000 [ 92%] (Sampling)
## Chain 3 Iteration: 3900 / 4000 [ 97%] (Sampling)
## Chain 4 Iteration: 2400 / 4000 [ 60%] (Sampling)
## Chain 1 Iteration: 3400 / 4000 [ 85%] (Sampling)
## Chain 2 Iteration: 3800 / 4000 [ 95%] (Sampling)
## Chain 3 Iteration: 4000 / 4000 [100%] (Sampling)
## Chain 3 finished in 3875.7 seconds.
## Chain 4 Iteration: 2500 / 4000 [ 62%] (Sampling)
## Chain 2 Iteration: 3900 / 4000 [ 97%] (Sampling)
## Chain 1 Iteration: 3500 / 4000 [ 87%] (Sampling)
## Chain 4 Iteration: 2600 / 4000 [ 65%] (Sampling)
## Chain 2 Iteration: 4000 / 4000 [100%] (Sampling)
## Chain 2 finished in 4042.5 seconds.
## Chain 1 Iteration: 3600 / 4000 [ 90%] (Sampling)
## Chain 4 Iteration: 2700 / 4000 [ 67%] (Sampling)
## Chain 1 Iteration: 3700 / 4000 [ 92%] (Sampling)
## Chain 4 Iteration: 2800 / 4000 [ 70%] (Sampling)
## Chain 1 Iteration: 3800 / 4000 [ 95%] (Sampling)
## Chain 4 Iteration: 2900 / 4000 [ 72%] (Sampling)
## Chain 1 Iteration: 3900 / 4000 [ 97%] (Sampling)
## Chain 4 Iteration: 3000 / 4000 [ 75%] (Sampling)
## Chain 1 Iteration: 4000 / 4000 [100%] (Sampling)
## Chain 1 finished in 4374.5 seconds.
## Chain 4 Iteration: 3100 / 4000 [ 77%] (Sampling)
## Chain 4 Iteration: 3200 / 4000 [ 80%] (Sampling)
## Chain 4 Iteration: 3300 / 4000 [ 82%] (Sampling)
## Chain 4 Iteration: 3400 / 4000 [ 85%] (Sampling)
## Chain 4 Iteration: 3500 / 4000 [ 87%] (Sampling)
## Chain 4 Iteration: 3600 / 4000 [ 90%] (Sampling)
## Chain 4 Iteration: 3700 / 4000 [ 92%] (Sampling)
## Chain 4 Iteration: 3800 / 4000 [ 95%] (Sampling)
## Chain 4 Iteration: 3900 / 4000 [ 97%] (Sampling)
## Chain 4 Iteration: 4000 / 4000 [100%] (Sampling)
## Chain 4 finished in 4984.6 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 4319.3 seconds.
## Total execution time: 4985.1 seconds.
We check standard MCMC diagnostics to assess convergence:
summary(fit_hier)$fixed
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 6.25462118 0.083449494 6.08731766 6.41905992 1.002032 1648.328
## GHI_scaled 1.35303200 0.003489750 1.34622030 1.35988964 1.000102 8811.254
## temp_scaled -0.10529976 0.005219221 -0.11540306 -0.09487033 1.000802 7861.817
## clouds_scaled 0.05989996 0.003057899 0.05400022 0.06581612 1.000184 9693.994
## Tail_ESS
## Intercept 2471.083
## GHI_scaled 7756.335
## temp_scaled 7522.462
## clouds_scaled 6721.690
summary(fit_hier)$random
## $month_factor
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sd(Intercept) 0.2849652 0.06793572 0.1883176 0.4502614 1.000588 1902.399
## Tail_ESS
## sd(Intercept) 3011.343
plot(fit_hier) # trace plots and densities
bayesplot::mcmc_rhat(rhat(fit_hier))
We look for:
If diagnostics were poor, we would consider reparameterizing,
increasing iter, or adjusting adapt_delta. In
the runs for this project, the diagnostics were satisfactory, suggesting
that the posterior has been explored adequately.
We first examine the posterior summaries for the global regression coefficients.
fixef(fit_hier)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 6.25462118 0.083449494 6.08731766 6.41905992
## GHI_scaled 1.35303200 0.003489750 1.34622030 1.35988964
## temp_scaled -0.10529976 0.005219221 -0.11540306 -0.09487033
## clouds_scaled 0.05989996 0.003057899 0.05400022 0.06581612
A typical output (values will depend on the actual data and fit) might look like:
GHI_scaled: large positive coefficient.temp_scaled: small positive or negative
coefficient.clouds_scaled: negative coefficient.Because we standardized the predictors, the coefficient for
GHI_scaled can be interpreted as:
The expected change in log-energy associated with a one-standard-deviation increase in GHI, holding temperature and cloud cover constant.
Exponentiating the coefficient gives a multiplicative factor on energy itself. For example, if
beta_GHI ≈ 0.9
then
\[ \exp(0.9) \approx 2.46, \]
meaning that a one-standard-deviation increase in GHI multiplies expected energy by about 2.5, all else equal.
Similarly, if the posterior for clouds_scaled is
negative with a 95% credible interval that does not include zero, we can
say:
After adjusting for irradiance and temperature, increased cloud cover is credibly associated with lower solar energy output.
We next look at the month-level random intercepts:
ranef(fit_hier)$month_factor
## , , Intercept
##
## Estimate Est.Error Q2.5 Q97.5
## 1 -0.16957028 0.08450064 -0.33531960 0.001369822
## 2 0.31068395 0.08416339 0.14467172 0.482170484
## 3 0.22333723 0.08398299 0.05829526 0.393368625
## 4 -0.01728495 0.08380422 -0.18357309 0.151793372
## 5 -0.21782349 0.08379834 -0.38414588 -0.049732464
## 6 -0.30395514 0.08396682 -0.47049822 -0.137400302
## 7 -0.12426871 0.08390127 -0.28995853 0.043914281
## 8 0.01648338 0.08410347 -0.14791470 0.185749931
## 9 0.30004770 0.08397231 0.13435249 0.470095332
## 10 0.42304975 0.08411682 0.25782863 0.593190008
## 11 -0.08050177 0.08430716 -0.24694560 0.090252931
## 12 -0.33210447 0.08457605 -0.49791616 -0.162368900
We can also visualize them with uncertainty intervals:
# Extract month-level random intercepts and tidy them
month_effects <- ranef(fit_hier)$month_factor[, , "Intercept"] %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "month") %>% # get month from rownames
mutate(
month = as.integer(month) # convert "1","2",... to integers
) %>%
arrange(month)
# Plot with months in the correct numeric order
ggplot(month_effects, aes(x = factor(month, levels = 1:12), y = Estimate)) +
geom_point() +
geom_errorbar(aes(ymin = Q2.5, ymax = Q97.5), width = 0.15) +
labs(
title = "Posterior Month-Level Intercepts (log scale)",
x = "Month",
y = "Month-specific intercept for log(energy_wh)"
)
Interpretation:
The hierarchical prior shrinks these intercepts toward a common mean, especially for months with little data. This is a core advantage of the hierarchical Bayesian approach.
sigma(fit_hier)
## numeric(0)
The residual standard deviation \(\sigma\) on the log scale tells us how much variability remains after accounting for GHI, temperature, clouds, and month-level effects. A moderate value indicates that the predictors explain a substantial portion of the variation but that inherent stochasticity and unmodeled factors still play a role.
Posterior predictive checks help evaluate whether the model can reproduce key features of the observed data.
pp_check(fit_hier, ndraws = 100)
This command compares the distribution of simulated log-energy values from the posterior predictive distribution against the observed log-energy distribution.
If the model is reasonable, the simulated distributions will broadly overlay the observed distribution. Large discrepancies (for example, the model systematically under-predicting high-energy intervals) would suggest the need for model refinement, such as adding interaction terms or considering non-Gaussian error structures.
Overall, the model captures the main shape and spread of log-energy reasonably well, with some deviations at the extremes that are typical in real-world energy data.
One of the advantages of a Bayesian model is that it provides a full posterior predictive distribution for future observations. To illustrate this, we consider a simple scenario:
Predict solar energy output for a daylight interval in July under moderately strong irradiance, warm temperature, and relatively low cloud cover.
Concretely, suppose:
We translate these into standardized predictors using the means and standard deviations from our training data.
mean_GHI <- mean(solar_day$GHI)
sd_GHI <- sd(solar_day$GHI)
mean_temp <- mean(solar_day$temp)
sd_temp <- sd(solar_day$temp)
mean_cloud <- mean(solar_day$clouds_all)
sd_cloud <- sd(solar_day$clouds_all)
newdata <- tibble(
GHI_scaled = (800 - mean_GHI) / sd_GHI,
temp_scaled = (25 - mean_temp) / sd_temp,
clouds_scaled = (20 - mean_cloud) / sd_cloud,
month_factor = factor(7, levels = levels(solar_day$month_factor)) # July
)
newdata
## # A tibble: 1 × 4
## GHI_scaled temp_scaled clouds_scaled month_factor
## <dbl> <dbl> <dbl> <fct>
## 1 12.7 1.51 -1.24 7
We use posterior_predict() to generate draws from the
predictive distribution of log-energy, then exponentiate to obtain
predictions on the original Wh scale.
set.seed(2025)
log_pred_draws <- posterior_predict(fit_hier, newdata = newdata, draws = 2000)
energy_pred_draws <- exp(log_pred_draws[, 1])
quantile(energy_pred_draws, probs = c(0.05, 0.5, 0.95))
## 5% 50% 95%
## 2523837642 10965648090 47434029887
We can also visualize the predictive distribution:
pred_df <- tibble(energy = energy_pred_draws)
ggplot(pred_df, aes(x = energy)) +
geom_histogram(bins = 40, color = "black") +
labs(
title = "Posterior Predictive Distribution of Energy (Example July Scenario)",
x = "Predicted energy_wh",
y = "Count"
)
For this example scenario, we might summarize the result in plain language as:
Given the model and the data, under moderately strong irradiance (GHI ≈ 800), warm temperature (25°C), and low cloud cover (20%) in July, the expected energy output in one interval is about X Wh, with most plausible values falling between Y Wh and Z Wh.
Here, X is the posterior median and Y–Z is a chosen credible interval, such as 90% or 95%. Rather than providing a single point forecast, we convey an entire distribution of plausible outcomes, which is a key strength of Bayesian predictive analysis.
Using a hierarchical Bayesian regression model, we analyzed solar energy production as a function of irradiance and weather variables, with month-level random intercepts. The main findings can be summarized as follows:
Irradiance (GHI) is, as expected, the dominant driver of solar
energy output. The posterior distribution for the coefficient of
GHI_scaled is strongly positive, and its credible interval
lies well above zero. On the original scale, a one-standard-deviation
increase in GHI corresponds to a substantial multiplicative increase in
expected energy production.
Cloud cover tends to reduce solar output even after controlling for GHI. This suggests that, for a given measured irradiance, additional cloudiness still has a negative effect, possibly by increasing short-term variability or by affecting panel temperature.
Temperature has a smaller and more nuanced effect. Depending on the dataset, the temperature coefficient may be slightly positive (warmer conditions correlated with higher output) or negative (very high temperatures reducing panel efficiency). The posterior uncertainty interval typically includes small but non-negligible values, so temperature’s impact is plausible but not as dominant as irradiance.
Month-to-month variation is captured by the hierarchical intercepts. Some months have higher baseline log-energy than others even after adjusting for instantaneous weather conditions, reflecting seasonal factors such as sun angle, typical atmospheric conditions, and perhaps operational aspects. The hierarchical prior shrinks these month-level intercepts toward a common mean, especially for months with fewer observations.
The Bayesian hierarchical framework is well-suited to this problem for several reasons:
Partial pooling across months: Rather than treating each month independently or forcing them to be identical, the model allows for structured variation and learns how similar months are to each other.
Uncertainty quantification: The posterior distributions for coefficients, month-level effects, and predictive quantities provide a clear picture of uncertainty. This is more informative than a single point estimate and helps avoid over-confidence in forecasts.
Posterior predictive checks: By simulating from the posterior predictive distribution, we can assess whether the model reproduces key features of the observed data. In this project, the checks show that the model captures the main shape and spread of log-energy reasonably well.
Interpretability: The regression coefficients are directly interpretable, especially on the log-scale (multiplicative effects on energy). The hierarchical structure is naturally explained in terms of seasonality and partial sharing of information.
Overall, Bayesian methods are both appropriate and informative for this type of renewable energy modeling.
Several limitations should be noted:
Modeling only daylight intervals: We restricted the analysis to
times when isSun == 1 and both GHI and energy output are
positive. This is sensible for studying the relationship between weather
and active production, but it excludes transitions at dawn and dusk and
does not model zero-output periods.
Simplified functional form: The model assumes a linear relationship on the log-scale between energy and the standardized predictors. In reality, the relationship between energy and irradiance, temperature, and cloud cover may be nonlinear, and there may be important interactions (for example, temperature moderating the effect of GHI).
Omitted variables: The dataset does not explicitly include all possible factors, such as panel tilt, shading, maintenance events, or grid curtailment. These unobserved influences are absorbed into the residual error term and may limit predictive accuracy.
Time dependence: The model does not explicitly account for autocorrelation over time. Solar production series can exhibit temporal dependence (e.g., cloudy periods vs. clear periods) that a time-series model or Gaussian process might capture more fully.
Future work could extend this project in several directions:
Nonlinear and interaction terms: Incorporating spline terms or interaction effects (for example, GHI × temperature) could provide a more flexible relationship between weather and energy.
Time-series structure: Adding time-series components or Gaussian process priors could capture temporal dependence and improve short-term forecasting.
More detailed hierarchy: If data from multiple sites (different solar plants or locations) were available, the model could include site-level random effects nested within regions or climatic zones.
Mixture models: A mixture model could distinguish between different operational regimes (e.g., clear sky vs. partly cloudy, or normal operation vs. maintenance events).
Despite these limitations, the present hierarchical Bayesian model already demonstrates how modern Bayesian methods can be used to analyze renewable energy production, quantify uncertainty, and generate probabilistic forecasts that are directly useful for planning and decision-making.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.
Kaggle. (2025). Renewable Energy and Weather Conditions [Data set]. Retrieved from https://www.kaggle.com/datasets/samanemami/renewable-energy-and-weather-conditions
This analysis and report were partially assisted by generative AI (ChatGPT, OpenAI). All modeling decisions, interpretation, and verification were reviewed and finalized by the author.